bioRxiv  preprint  doi:  https://doi.Org/10.1 101/2020.01 .30.927871 .  this  version  posted  January  31 , 2020.  The  copyright  holder  for  this  preprint 
(which  was  not  certified  by  peer  review)  is  the  author/funder.  It  is  made  available  under  a  CC-BY-NC-ND  4.0  International  license. 


Uncanny  similarity  of  unique  inserts  in  the  2019-nCoV  spike  protein  to  HIV-1  gpl20 

and  Gag 

Prashant  Pradhan$1,2,  Ashutosh  Kumar  Pandcysl,  Akhilesh  Mishra$1,  Parul  Gupta1,  Praveen 
Kumar  Tripathi1,  Manoj  Balakrishnan  Menon1,  James  Gomes1,  Perumal  Vivekanandan*1and 

Bishwajit  Kundu*1 

1  Kusurna  School  of  biological  sciences,  Indian  institute  of  technology,  New  Delhi- 110016,  India. 
2Acharya  Narendra  Dev  College,  University  of  Delhi,  New  Delhi-110019,  India 


*  Corresponding  authors-  email:  bkundu@bii 


Abstract: 


We  are  currently  witnessing  a  major  epidemic  caused  by  the  2019  novel  coronavirus  (2019- 
nCoV).  The  evolution  of  2019-nCoV  remains  elusive.  We  found  4  insertions  in  the  spike 
glycoprotein  (S)  which  are  unique  to  the  2019-nCoV  and  are  not  present  in  other  coronaviruses. 
Importantly,  amino  acid  residues  in  all  the  4  inserts  have  identity  or  similarity  to  those  in  the  HIV- 
1  gpl20  or  HIV-1  Gag.  Interestingly,  despite  the  inserts  being  discontinuous  on  the  primary 
amino  acid  sequence,  3D-modelling  of  the  2019-nCoV  suggests  that  they  converge  to  constitute 
the  receptor  binding  site.  The  finding  of  4  unique  inserts  in  the  2019-nCoV,  all  of  which  have 
identity  /similarity  to  amino  acid  residues  in  key  structural  proteins  of  HIV-1  is  unlikely  to  be 
fortuitous  in  nature.  This  work  provides  yet  unknown  insights  on  2019-nCoV  and  sheds  light  on 
the  evolution  and  pathogenicity  of  this  virus  with  important  implications  for  diagnosis  of  this  virus. 

Introduction 

Coronaviruses  (CoV)  are  single-stranded  positive- sense  RNA  viruses  that  infect  animals  and 
humans.  These  are  classified  into  4  genera  based  on  their  host  specificity:  Alpliacoronavirus, 
Betacoronavirus,  Deltacoronavirus  and  Gammacoronavirus  (Snijder  et  al.,  2006).  There  are  seven 
known  types  of  CoVs  that  includes  229E  and  NL63  (Genus  Alphacoronavirus),  OC43,  HKU1, 
MERS  and  SARS  (Genus  Betacoronavirus).  While  229E,  NL63,  OC43,  and  HKU1  commonly 
infect  humans,  the  SARS  and  MERS  outbreak  in  2002  and  2012  respectively  occurred  when  the 
virus  crossed-over  from  animals  to  humans  causing  significant  mortality  (J.  Chan  et  al.,  n.d.;  J.  F. 
W.  Chan  et  al.,  2015).  In  December  2019,  another  outbreak  of  coronavirus  was  reported  from 
Wuhan,  China  that  also  transmitted  from  animals  to  humans.  This  new  virus  has  been  temporarily 
termed  as  2019-novel  Coronavirus  (2019-nCoV)  by  the  World  Health  Organization  (WHO)  (J.  F.- 
W.  Chan  et  al.,  2020;  Zhu  et  al.,  2020).  While  there  are  several  hypotheses  about  the  origin  of 
2019-nCoV,  the  source  of  this  ongoing  outbreak  remains  elusive. 

The  transmission  patterns  of  2019-nCoV  is  similar  to  patterns  of  transmission  documented  in  the 
previous  outbreaks  including  by  bodily  or  aerosol  contact  with  persons  infected  with  the  virus. 
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Cases  of  mild  to  severe  illness,  and  death  from  the  infection  have  been  reported  from  Wuhan.  This 
outbreak  has  spread  rapidly  distant  nations  including  France,  Australia  and  USA  among  others. 
The  number  of  cases  within  and  outside  China  are  increasing  steeply.  Our  current  understanding 
is  limited  to  the  virus  genome  sequences  and  modest  epidemiological  and  clinical  data. 
Comprehensive  analysis  of  the  available  2019-  nCoV  sequences  may  provide  important  clues  that 
may  help  advance  our  current  understanding  to  manage  the  ongoing  outbreak. 

The  spike  glycoprotein  (S)  of  cornonavirus  is  cleaved  into  two  subunits  (SI  and  S2).  The  SI 
subunit  helps  in  receptor  binding  and  the  S2  subunit  facilitates  membrane  fusion  (Bosch  et  al., 
2003;  Li,  2016).  The  spike  glycoproteins  of  corono viruses  are  important  determinants  of  tissue 
tropism  and  host  range.  In  addition  the  spike  glycoproteins  are  critical  targets  for  vaccine 
development  (Du  et  al.,  2013).  For  this  reason,  the  spike  proteins  represent  the  most  extensively 
studied  among  coronaviruses.  We  therefore  sought  to  investigate  the  spike  glycoprotein  of  the 
2019-nCoV  to  understand  its  evolution,  novel  features  sequence  and  structural  features  using 
computational  tools. 


Methodology 


Retrieval  and  alignment  of  nucleic  acid  and  protein  sequences 

We  retrieved  all  the  available  coronavirus  sequences  (n=55)  from  NCBI  viral  genome  database 
(https://www.ncbi.nlm.nih.gov/)  and  we  used  the  GISAID  (Elbe  &  Buckland-Merrett, 
2017)[hups://www.gisaid.or|r/§Mo  retrieve  all  available  full-length  sequences  (n=28)  of  2019- 
nCoV  as  on  27  Jan  2020.  Multiple  sequence  alignment  of  all  coronavirus  genomes  was  performed 
by  using  MUSCLE  software  (Edgar,  2004)  based  on  neighbour  joining  method.  Out  of  55 
coronavirus  genome  32  representative  genomes  of  all  category  were  used  for  phylogenetic  tree 
development  using  MEGAX  software  (Kumar  et  al.,  2018).  The  closest  relative  was  found  to  be 
SARS  CoV.  The  glycoprotein  region  of  SARS  CoV  and  2019-nCoV  were  aligned  and  visualized 
using  Multalin  software  (Corpet,  1988).  The  identified  amino  acid  and  nucleotide  sequence  were 
aligned  with  whole  viral  genome  database  using  BLASTp  and  BLASTn.  The  conservation  of  the 
nucleotide  and  amino  acid  motifs  in  28  clinical  variants  of  2019-nCoV  genome  were  presented  by 
performing  multiple  sequence  alignment  using  MEGAX  software.  The  three  dimensional  structure 
of  2019-nCoV  glycoprotein  was  generated  by  using  SWISS-MODEL  online  server  (Biasini  et  al., 
2014)  and  the  structure  was  marked  and  visualized  by  using  PyMol  (DeLano,  2002). 


Results 


Uncanny  similarity  of  novel  inserts  in  the  2019-nCoV  spike  protein  to  HIV-1  gpl20  and 
Gag 

Our  phylogentic  tree  of  full-length  coronaviruses  suggests  that  2019-nCoV  is  closely  related  to 
SARS  CoV  [Fig  1  ] .  In  addition,  other  recent  studies  have  linked  the  2019-nCoV  to  SARS  CoV. 
We  therefore  compared  the  spike  glycoprotein  sequences  of  the  2019-nCoV  to  that  of  the  SARS 
CoV  (NCBI  Accession  number:  AY390556.1).  On  careful  examination  of  the  sequence 
alignment  we  found  that  the  2019-  nCoV  spike  glycoprotein  contains  4  insertions  [Fig. 2].  To 
further  investigate  if  these  inserts  are  present  in  any  other  corona  virus,  we  performed  a  multiple 
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sequence  alignment  of  the  spike  glycoprotein  amino  acid  sequences  of  all  available 
coronaviruses  (n=55)  [refer  Table  S.Filel]  in  NCBI  refseq  (ncbi.nlm.nih.gov)  this  includes  one 
sequence  of  2019-nCoV[Fig.Sl].  We  found  that  these  4  insertions  [inserts  1,  2,  3  and  4]  are 
unique  to  2019-nCoV  and  are  not  present  in  other  coronaviruses  analyzed.  Another  group  from 
China  had  documented  three  insertions  comparing  fewer  spike  glycoprotein  sequences  of 
coronaviruses  .  Another  group  from  China  had  documented  three  insertions  comparing  fewer 


Figure  1:  Maximum  likelihood  genealogy  show  the  evolution  of  2019-  nCoV:  The  evolutionary  history 
was  inferred  by  using  the  Maximum  Likelihood  method  and  JTT  matrix -based  model.  The  tree 
with  the  highest  log  likelihood  (12458.88)  is  shown.  Initial  tree(s)  for  the  heuristic  search  were 
obtained  automatically  by  applying  Neighbor-Join  and  BioNJ  algorithms  to  a  matrix  of  pairwise 
distances  estimated  using  a  JTT  model,  and  then  selecting  the  topology  with  superior  log  likelihood 


bioRxiv  preprint  doi:  https://doi.Org/10.1 101/2020.01 .30.927871 .  this  version  posted  January  31 , 2020.  The  copyright  holder  for  this  preprint 
(which  was  not  certified  by  peer  review)  is  the  author/funder.  It  is  made  available  under  a  CC-BY-NC-ND  4.0  International  license. 


value.  This  analysis  involved  5  amino  acid  sequences.  There  were  a  total  of  1387  positions  in  the 
final  dataset.  Evolutionary  analyses  were  conducted  in  MEGA  X. 


10 


20 


30 


40 


50 


60 


70 


Insert  1 

80 


90 


100 


110 


2019-nCoV 

SHRS-GZ02 

Consensus 


2019-nCoV 

SRRS-GZ02 

Consensus 


HFVFLVLLPLVSSQCVNLTTRTQ — LPPflYTN — SFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTMFHRIHV1 GTNGTKR 
MFIFLLFLTLTSGSDLDRCTTFDDVQRPNYTQHTSSMRGVYYPDEIFRSDTLYLTQDLFLPFYSNVTGFHTINH1 
hF IFLllLpLtSgqdllrcTrf i..qaPaYTI..Sf  nRGVYYPDelFRSdtLhlTQOLFLPFXSNVTgFHalnh: 

Insert  2 

131  140  150  160  170  180  190  200 


210 


DNPVLPFNDGVYFRSTEKSNIIRGMIFGTTLDSKTQSLLIVNNRTNV 
DNPVIPFKUGIYFRRTEKSNVVRGHVFGSTMNNKSQSVIIINNSTNV 
DNPViPFnDG ! YFfiaTEKSN ! ! RGH ! FGsTtinKsQSlil ! NNaTNV 

mlt  Insert  3 

220  230  240  250  260 

-4i 


I - ♦ - - - - ♦ - ♦ - ♦- 

VIKVCEFQFCNDPFLGV fYHKNNKS  IMESEFRVYSSRNNCTFEYVSQPFLMOLEGKQGNFKNLREFVFKNIOGYFKIYSKHTPINLVROLPQGFSRLEPLVDLPIGINITRFQTLLRLHRSYLTrtGDSSS 

VIRRCNFELCDNPFFRY - SKPtt  iTQTHTHIFDNRFNCTFEYISDflFSLDVSEKSGNFKHLREFVFKNKDGFLYVYKGYQPIOVVRDLPSGFNTLKPIFKLPLGINITNFRRILT - RFLPF  QDT— 

VlraCiFIlCttPFlaV . . . .nnkn  ;n#sef r ! XdnflnNCTFEY ! StaFltDleeKqGNFKnLREFVFKNiOGXlk ! YkghqPI ilVRDLPqGFnaLePif dLPiGINITrFraiLa. . .aXLpi qOs . . 


2019-nCoV 

SRRS-GZ02 

Consensus 


GtTRGRRHYYVGYLQPRTFLLKYNENGTITORVDCHLOPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNRTRFHSVYRMNRKRISNCVROYSVLYNSHSFSTFKCYGVSPTK 
-  IGTSRRRYFVGYLKPTTFhLKYDENGTITDRVDCSQNPLHEI KCSVKSFEIDKGIYQTSNFRVVPSRDVVRFPNITNLCPFGEVFNRTKFPSVYRMERKRISNCVfiDYSVLYNSTFFSTFKCYGVSfiTK 
. igagflHfiYXVGYLqPrTFtLKYiENGTITDRVDC.iqtPLaElKCslKSFis! iKGIYQTSNFRVqPsrd! VRFPNITNLCPFGEVFNHTrFaSVYHM#RKRISNCVflDYSVLYNSafFSTFKCYGVSaTK 


2019-nCoV 

SRRS-GZ02 

Consensus 


2019-nCoV 

SHRS-GZ02 

Consensus 


2019-nCoV 

SRRS-GZ02 

Consensus 


2019-nCoV 

SHRS-GZ02 

Consensus 


LNDLCFTNVYRDSFVIRGDEVRQIRPGQTGKIflDYNYKLPODFTGCVIflUNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQRGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFE 

LNDLCFSNVYflDSFVVKGDDVRQIRPGQTGVIRDYNYKLPDDFtlGCVLRUHTRNIDRTSTGHYHYKYRYLRHGKLRPFERDISNVPFSPDGKPCTP-PRLHCYHPLNDYGFYTTTGIGYQPYRVVVLSFE 

LNDLCFsNVYflDSFV!rGDiVRQIflPGQTGkIflDYNYKLPDDFnGCViflHHsrMiDaksgGNYHYlYRllRhgnLrPFERDISnei%qadgkPCng.ealHCYfPLidYGFqpTnG!GYQPYRVVVLSFE 


521  530  540  550  560  570  580  590  600  610  620  630  640  650 

I - — ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - 1 

LLHflPRTVCGPKKSTHLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIRDTTDRVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVRVLYQDVNCTEVPVRIHRDQLTPTURVYSTGSNVFQTR 
LLNRPRTVCGPKLSTOLIKNQCVNFNFNGLTGTGVLTPSSKRFQPFQQFGRDVSDFTDSVROPKTSEILOISPCSFGGVSVITPGTNRSSEVRVLYQDVNCTOVSTRIHRDQLTPflURIYSTGNNVFQTQ 
LLnflPRTVCGPKlST  #L ! KNqCVNFNFNGLTGTGVLTeSnKrFqPFQQFGRD ! aDf TDaVRDPqTlEILDIsPCSFGGVSVITPGTHaSniVRVLYQDVNCTiVptRIHRDQLTPaUR ! YSTGnNVFQTr 

651  660  68011  '  ^  690  700  710  720  730  740  750  760  770  780 

-I 


AGCLIGAEHVNNSYECDIPIGAGICASYQ1 JTNSPRRH ?SVflSQSIIflYTHSLGRENSVRYSNNSIRIPTNFTISVTTEILPVSHTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRRLTGIRVEQDK 
RGCLIGREHVDTSYECDIPIGRGICRSYH1  i/S— — LL ?STSQKSIVRYThSLGRDSSIRYSNNTIRIPTNFSISITTEVMPVSHRKTSVDCNMYICGDSTECflNLLLQYGSFCTQLNRRLSGIRREQDR 
RGCLIGREHVinSYECDIPIGRGICRSYql qs. . . .ra  IStaqqSI ! RYTHSLGfl»nS ! RYSNNsIRIPTNFsIS ! TTE ! f PVSHaKTSVDCnMYICGDSTECaNLLLQYGSFCTQLNRRLsGIflaEQDr 


781  790  800  810  820  830  840  850  860  870  880  890  900  910 

NTQEVFRQVKQIYKTPPIKOFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLRDRGFIKQYGDCLGDIflRRDLICRQKFNGLTVLPPLLTDEhlfiQYTSRLLRGTITSGUTFGflGfiRLQIPFfiHQhRYRF 

NTREVFflQVKQHYKTPTLKOFGGFNFSQILPOPLKPTKRSFIEOLLFNKVTLfiDflGFMKQYGECLGDINRRDLICflQKFNGLTVLPPLLTDDMRRYTRRLVSGTRTRGUTFGfiGfiflLQIPFRttQnfiYRF 

NTrEVFRQVKQiYKTPpiKDFGGFNFSQILPDPlKPsKRSFIEDLLFNKVTLRDRGFiKQYGICLGDIaRRDLICRQKFNGLTVLPPLLTD#niflaYTaRLlaGTaTaGMTFGRGRRLQIPFRttQHRYRF 


911 


920 


930 


940 


950 


960 


970 


980 


990 


1000 


1010 


1020 


1030 


1040 


2019-nCoV 

SRRS-GZ02 

Consensus 


NGIGVTQNVLYENQKLIRNQFNSRIGKIQDSLSSTfiSfiLGKLQDVVNQNfiQRLNTLVKQLSSNFGfilSSVLNOILSRLOKVEflEVQIDRLITGRLQSLQTYVTQQLIRRfiEIRflSRNLfiRTKMSECVLGQ 

NGIGVTQNVLYENQKQIRNQFNKRISQIQESLTTTSTRLGKLQDVVNQNRQRLNTLVKQLSSNFGRISSVLNDILSRLDKVEREVQIDRLITGRLQSLQTYVTQQLIRRREIRRSRNLRRTKnSECVLGQ 

NGIGVTQNVLYENQKqIRNQFNkRIgqIQ«SLssTasRLGKLQDVVNQNRQRLNTLVKQLSSNFGRISSVLNDILSRLDKVEREVQIDRLITGRLQSLQTYVTQQLIRRREIRRSRNLRRTKhSECVLGQ 


2019-nCoV 

SRRS-GZ02 

Consensus 


SKRVDFCGKGYHLhSFPQSflPHGVVFLHVTYVPRQEKNFTTRPRICHDGKflHFPREGVFVSHGTHHFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELOSFKEELDKYFKHHTSPDVDL 
SKRVDFCGKGYHLrtSFPQflRPHGVVFLHVTYVPSQERNFTTflPRICHEGKflYFPREGVFVFNGTSHFITQRNFFSPQIITTDNTFVSGNCDVVIGIINNTVYDPLQPELDSFKEELDKYFKHHTSPOVDL 
SKRVDFCGKGYHLHSFPQaflPHGVVFLHVTYVPaQErNFTTRPflICHtGKRhFPREGVFVfNGThUF !TQRNF%ePQIITTDNTFVSGNCDVVIGI!HNTVYDPLQPELDSFKEELDKYFKNHTSPOVDL 


2019-nCoV 

SRRS-GZ02 

Consensus 


1171  1180  1190  1200  1210  1220  1230  1240  1250  1260  1270  1277 

I - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - ♦ - 1 

GDISGINRSVVNIQKEIDRLNEVRKNLNESLIDLQELGKYEQYIKWPUYIULGFIRGLIRIVHVTIttLCCHTSCCSCLKGCCSCGSCCKFDEODSEPVLKGVKLHYT 

GDISGINRSVVNIQEEIORLNEVRKNLNESLIDLQELGKYEQYIKHPUYVMLGFIRGLIRIVMVTILLCCHTSCCSCLKGRCSCGSCCKFDEDDSEPVLKGVKLHYT 

GOISGINRSVVNIQeEIORLNEVfiKNLNESLIOLQELGKYEQYIKUPUYiULGFIRGLIRIVMVTIlLCCMTSCCSCLKGaCSCGSCCKFDEDOSEPVLKGVKLHYT 


Figure  2:  Multiple  sequence  alignment  between  spike  proteins  of  2019-nCoV  and  SARS.  The 

sequences  of  spike  proteins  of  2019-nCoV  (Wuhan-HU-1,  Accession  NC_045512)  and  of  SARS 
CoV  (GZ02,  Accession  AY390556)  were  aligned  using  MultiAlin  software.  The  sites  of  difference 
are  highlighted  in  boxes. 

We  then  analyzed  all  available  full-length  sequences  (n=28)  of  2019-nCoV  in  GISAID  (Elbe  & 
Buckland-Merrett,  2017)  as  on  January  27,  2020  for  the  presence  of  these  inserts.  As  most  of  these 
sequences  are  not  annotated,  we  compared  the  nucleotide  sequences  of  the  spike  glycoprotein  of 
all  available  2019-nCoV  sequences  using  BLASTp.  Interestingly,  all  the  4  insertions  were 
absolutely  (100%)  conserved  in  all  the  available  2019-  nCoV  sequences  analyzed  [Fig.S2,  Fig. S3]. 
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We  then  translated  the  aligned  genome  and  found  that  these  inserts  are  present  in  all  Wuhan  2019- 
nCoV  viruses  except  the  2019-nCoV  virus  of  Bat  as  a  host  [Fig.S4].  Intrigued  by  the  4  highly 
conserved  inserts  unique  to  2019-nCoV  we  wanted  to  understand  their  origin.  For  this  purpose, 
we  used  the  2019-nCoV  local  alignment  with  each  insert  as  query  against  all  virus  genomes  and 
considered  hits  with  100%  sequence  coverage.  Surprisingly,  each  of  the  four  inserts  aligned  with 
short  segments  of  the  Human  immunodeficiency  Virus- 1  (HIV-1)  proteins.  The  amino  acid 
positions  of  the  inserts  in  2019-nCoV  and  the  corresponding  residues  in  HIV-1  gpl20  and  HIV-1 
Gag  are  shown  in  Table  1.  The  first  3  inserts  (insert  1,2  and  3)  aligned  to  short  segments  of  amino 
acid  residues  in  HIV-1  gpl20.  The  insert  4  aligned  to  HIV-1  Gag.  The  insert  M6  amino  acid 
residues)  and  insert  2  (6  amino  acid  residues)  in  the  spike  glycoprotein  of  2019-nCoV  are  100% 
identical  to  the  residues  mapped  to  HIV-1  gpl20.  The  insert  3(12  amino  acid  residues)  in  2019- 
nCoV  maps  to  HIV-1  gpl20  with  gaps  [see  Table  1].  The  insert  4  (8  amino  acid  residues)  maps  to 
HIV-1  Gag  with  gaps. 

Although,  the  4  inserts  represent  discontiguous  short  stretches  of  amino  acids  in  spike  glycoprotein 
of  2019-nCoV,  the  fact  that  all  three  of  them  share  amino  acid  identity  of  similarity  with  HIV-1 
gpl20  and  HIV-F  Gag  (among  all  annotated  virus  pro t ein s/fAugge s t s  that  this  is  not  a  random 
fortuitous  finding.  In  other  words,  one  may  sporadically  expect  a  fortuitous  match  for  a  stretch  of 
6-12  contiguous  amino  acid  residues  in  an  unrelated  protein.  However,  it  is  unlikely  that  all  4 
inserts  in  the  2019-nCoV  spike  glycoprotein  fortuitously  match  with  2  key  structural  proteins  of 
an  unrelated  virus  (HIV-1). 


.O. 


The  amino  acid  residues  of  inserts  1,  2  and  3  of  2019-nCoV  spike  glycoprotein  that  mapped  to 
HIV-1  were  apart  of  the  V4,  V5  and  VI  domains  respectively  in  gpl20  [Table  1].  Since  the  2019- 
nCoV  inserts  mapped  to  variable  regions  of  HIV- 1,  they  were  not  ubiquitous  in  HIV-1  gpl20,  but 
were  limited  to  selected  sequences  of  HIV- 1  [  refer  S.Filel]  primarily  from  Asia  and  Africa. 


The  HIV-1  Gag  protein  enables  interaction  of  virus  with  negatively  charged  host  surface 
(Murakami,  2008)  and  a  high  positive  charge  on  the  Gag  protein  is  a  key  feature  for  the  host-virus 
interaction.  On  analyzing  the  pi  values  for  each  of  the  4  inserts  in  2019-nCoV  and  the 
corresponding  stretches  of  amino  acid  residues  from  HIV-1  proteins  we  found  that  a)  the  pi  values 
were  very  similar  for  each  pair  analyzed  b)  most  of  these  pi  values  were  10+2  [Refer  Table  1]  .  Of 
note,  despite  the  gaps  in  inserts  3  and  4  the  pi  values  were  comparable.  This  uniformity  in  the  pi 
values  for  all  the  4  inserts  merits  further  investigation. 

As  none  of  these  4  inserts  are  present  in  any  other  coronavirus,  the  genomic  region  encoding  these 
inserts  represent  ideal  candidates  for  designing  primers  that  can  distinguish  2019-nCoV  from  other 


coronaviruses. 
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protein 
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HIV 
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Source 

Country/ 

subtype 

Number 
of  Polar 
Residues 
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Char 

ge 

Pi 

Valu 

e 

Insert 

1 

2019-  nCoV  (GP) 
HIV1(GP120) 

71  76 

TNGTKR 

TNGTKR 

404  409 

gp!20- 

V4 

Thailand 

*/ 

CRF01_ 

AE 

1 : 

2 

2 

11 

11 

Insert 

2 

2019-  nCoV  (GP) 
HIVKGP120) 

145  150  1 

HKNNKS 

HKNNKS 

462  467 

gpl20- 

^V5\o 

Kenya*/ 

nVt. 

6 

2 

2 

10 

10 

Insert 

3 

2019-  nCoV  (GP) 
HIVKGP120) 

245  256 

RSYL - TPGDSSSG 

RTYLFNETRGNSSSG 

136  ,rA  W  T50 

)Ox 

gpl20- 

VI 

India*/C 

8 

10 

2 

1 

10.84 

8.75 

Insert 

4 

2019-  nCoV  (Poly 
P) 

HIV  1  (gag) 

684 

QTNS . PRRA 

QTNSSILMQRSNFKG  PRRA 
366  384 

Gag 

India*/C 

6 

12 

2 

4 

12.00 

12.30 

Table  1:  Aligned  sequences  of  2019-nCoV  and  gpl20  protein  of  HIV-1  with  their  positions 
in  primary  sequence  of  protein.  All  the  inserts  have  a  high  density  of  positively  charged 
residues.  The  deleted  fragments  in  insert  3  and  4  increase  the  positive  charge  to  surface  area 
ratio.  *  please  see  Supp.  Table  1  for  accession  numbers 


The  novel  inserts  are  part  of  the  receptor  binding  site  of  2019-nCoV 

To  get  structural  insights  and  to  understand  the  role  of  these  insertions  in  2019-nCoV  glycoprotein, 
we  modelled  its  structure  based  on  available  structure  of  SARS  spike  glycoprotein  (PDB: 
6ACD.1.A).  The  comparison  of  the  modelled  structure  reveals  that  although  inserts  1,2  and  3  are 
at  non-contiguous  locations  in  the  protein  primary  sequence,  they  fold  to  constitute  the  part  of 
glycoprotein  binding  site  that  recognizes  the  host  receptor  (Kirchdoerfer  et  al.,  2016)  (Figure  4). 
The  insert  1  corresponds  to  the  NTD  (N-terminal  domain)  and  the  inserts  2  and  3  correspond  to 
the  CTD  (C-terminal  domain)  of  the  SI  subunit  in  the  2019-nCoV  spike  glycoprotein.  The  insert 
4  is  at  the  junction  of  the  SD1  (sub  domain  1)  and  SD2  (sub  domain  2)  of  the  SI  subunit  (Ou  et 
al.,  2017).  We  speculate,  that  these  insertions  provide  additional  flexibility  to  the  glycoprotein 
binding  site  by  forming  a  hydrophilic  loop  in  the  protein  structure  that  may  facilitate  or  enhance 
virus-host  interactions. 
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Insert  1  >  TNGTKR  mmm  ^Q\  Insert  2  >  HKNNKS  I  i 

Insert  3>  RSYL —  TPGDSSSG  H  Insert  4  >  QTNSPRRA  B 

Figure  3.  Modelled  homo-trimer  spike  glycoprotein  of  2019-nCoV  virus.  The  inserts  from  HIV 
envelop  protein  are  shown  with  colored  beads,  present  at  the  binding  site  of  the  protein. 

Evolutionary  Analysis  of  2019-nCoV 


It  has  been  speculated  that  2019-nCoV  is  a  variant  of  Coronavirus  derived  from  an  animal  source 
which  got  transmitted  to  humans.  Considering  the  change  of  specificity  for  host,  we  decided  to 
study  the  sequences  of  spike  glycoprotein  (S  protein)  of  the  virus.  S  proteins  are  surface  proteins 
that  help  the  virus  in  host  recognition  and  attachment.  Thus,  a  change  in  these  proteins  can  be 
reflected  as  a  change  of  host  specificity  of  the  virus.  To  know  the  alterations  in  S  protein  gene  of 
2019-nCoV  and  its  consequences  in  structural  re-arrangements  we  performed  in-sillico  analysis  of 
2019-nCoV  with  respect  to  all  other  viruses.  A  multiple  sequence  alignment  between  the  S  protein 
amino  acid  sequences  of  2019-nCoV,  Bat-SARS-Like,  SARS-GZ02  and  MERS  revealed  that  S 
protein  has  evolved  with  closest  significant  diversity  from  the  SARS-GZ02  (Figure  1). 

Insertions  in  Spike  protein  region  of  2019-nCoV 

Since  the  S  protein  of  2019-nCoV  shares  closest  ancestry  with  SARS  GZ02,  the  sequence  coding 
for  spike  proteins  of  these  two  viruses  were  compared  using  MultiAlin  software.  We  found  four 
new  insertions  in  the  protein  of  2019-nCoV-  “GTNGTKR”  (IS1),  “HKNNKS”  (IS2),  “GDSSSG” 
(IS3)  and  “QTNSPRRA”  (IS4)  (Figure  2).  To  our  surprise,  these  sequence  insertions  were  not  only 
absent  in  S  protein  of  SARS  but  were  also  not  observed  in  any  other  member  of  the  Coronaviridae 
family  (Supplementary  figure).  This  is  startling  as  it  is  quite  unlikely  for  a  virus  to  have  acquired 
such  unique  insertions  naturally  in  a  short  duration  of  time. 
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Insertions  share  similarity  to  HIV 


The  insertions  were  observed  to  be  present  in  all  the  genomic  sequences  of  2019-nCoV  virus 
available  from  the  recent  clinical  isolates  (Supplementary  Figure  1).  To  know  the  source  of  these 
insertions  in  2019-nCoV  a  local  alignment  was  done  with  BLASTp  using  these  insertions  as  query 
with  all  virus  genome.  Unexpectedly,  all  the  insertions  got  aligned  with  Fluman  immunodeficiency 
Virus-1  (FIIV-1).  Further  analysis  revealed  that  aligned  sequences  of  HIV-1  with  2019-nCoV  were 
derived  from  surface  glycoprotein  gpl20  (amino  acid  sequence  positions:  404-409,  462-467,  136- 
150)  and  from  Gag  protein  (366-384  amino  acid)  (Table  1).  Gag  protein  of  HIV  is  involved  in  host 
membrane  binding,  packaging  of  the  virus  and  for  the  formation  of  virus-like  particles.  Gpl20 
plays  crucial  role  in  recognizing  the  host  cell  by  binding  to  the  primary  receptor  CD4.This  binding 
induces  structural  rearrangements  in  GP120,  creating  a  high  affinity  binding  site  for  a  chemokine 
co-receptor  like  CXCR4  and/or  CCR5. 

Discussion 

.  V\  Yv  v# 

The  current  outbreak  of  20 19-nCo^ warrants  a  thorough  investigation  and  understanding  of  its 
ability  to  infect  human  beings.  Keeping  in  mind  that  there  has  been  a  clear  change  in  the  preference 
of  host  from  previous  coronaviruses  to  this  virus,  we  studied  the  change  in  spike  protein  between 
2019-nCoV  and  other  viruses.  We  found  fodt  new  insertions  in  the  S  protein  of  2019-nCoV  when 
compared  to  its  nearest  relative,  SARS  CoV.  The  genome  sequence  from  the  recent  28  clinical 
isolates  showed  that  the  sequence  coding  for  these  insertions  are  conserved  amongst  all  these 
isolates.  This  indicates  that  these  insertions  have  been  preferably  acquired  by  the  2019-nCoV, 
providing  it  with  additional  survival  and  infectivity  advantage.  Delving  deeper  we  found  that  these 
insertions  were  similar  to  HIV-1.  Our  results  highlight  an  astonishing  relation  between  the  gpl20 
and  Gag  protein  of  HIV,  with  2019-nCoV  spike  glycoprotein.  These  proteins  are  critical  for  the 
viruses  to  identify  and  latch  on  to  their  host  cells  and  for  viral  assembly  (Beniac  et  al.,  2006). 
Since  surface  proteins  are  responsible  for  host  tropism,  changes  in  these  proteins  imply  a  change 
in  host  specificity  of  the  virus.  According  to  reports  from  China,  there  has  been  a  gain  of  host 
specificity  in  case  2019-nCoV  as  the  virus  was  originally  known  to  infect  animals  and  not  humans 
but  after  the  mutations,  it  has  gained  tropism  to  humans  as  well. 


Moving  ahead,  3D  modelling  of  the  protein  structure  displayed  that  these  insertions  are  present  at 
the  binding  site  of  2019-nCoV.  Due  to  the  presence  of  gpl20  motifs  in  2019-nCoV  spike 
glycoprotein  at  its  binding  domain,  we  propose  that  these  motif  insertions  could  have  provided  an 
enhanced  affinity  towards  host  cell  receptors.  Further,  this  structural  change  might  have  also 
increased  the  range  of  host  cells  that  2019-nCoV  can  infect.  To  the  best  of  our  knowledge,  the 
function  of  these  motifs  is  still  not  clear  in  HIV  and  need  to  be  explored.  The  exchange  of  genetic 
material  among  the  viruses  is  well  known  and  such  critical  exchange  highlights  the  risk  and  the 
need  to  investigate  the  relations  between  seemingly  unrelated  virus  families. 

Conclusions 


Our  analysis  of  the  spike  glycoprotein  of  2019-nCoV  revealed  several  interesting  findings:  First, 
we  identified  4  unique  inserts  in  the  2019-nCoV  spike  glycoprotein  that  are  not  present  in  any 
other  coronavirus  reported  till  date.  To  our  surprise,  all  the  4  inserts  in  the  2019-nCoV  mapped  to 
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short  segments  of  amino  acids  in  the  HIV-1  gpl20  and  Gag  among  all  annotated  virus  proteins  in 
the  NCBI  database.  This  uncanny  similarity  of  novel  inserts  in  the  2019-  nCoV  spike  protein  to 
HIV-1  gpl20  and  Gag  is  unlikely  to  be  fortuitous.  Further,  3D  modelling  suggests  that  atleast  3  of 
the  unique  inserts  which  are  non-contiguous  in  the  primary  protein  sequence  of  the  2019-nCoV 
spike  glycoprotein  converge  to  constitute  the  key  components  of  the  receptor  binding  site.  Of  note, 
all  the  4  inserts  have  pi  values  of  around  10  that  may  facilitate  virus-host  interactions.  Taken 
together,  our  findings  suggest  unconventional  evolution  of  2019-nCoV  that  warrants  further 
investigation.  Our  work  highlights  novel  evolutionary  aspects  of  the  2019-nCoV  and  has 
implications  on  the  pathogenesis  and  diagnosis  of  this  virus. 
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INSERT  1 


INSERT  2 


INSERT  3 


INSERT  4 


1.  NC_043312.2_  •  -  G  FDNP 

2.  NC_034440.1_  •  -  •  •  T  S  N  V  A  F  I  SN 

3.  NC_006213.1_ . RLWFKPPr 

4.  NC_039208.1_ .  D 

3.  NC_039207.1_ . P  ’  S  N 

8.  NC_038801.1_  VVSNCTD  •  -  OCAS 

7.  NC_038294.1_  -  -  -  T  T  P  Q  K  L  f  VAN  SOD 

8.  NC_011347.1_ . .  fl  D  -  -  T  R  L  T  S 

9.  NC_033191.1_ 

10  NC_034972.1 

11.  NC  032730.1  . S 

12.  NC_032107.1  L  1  S  S  I  EPCQNCEGFAEN 

13.  NC  030886.1  -  ■  G  N  .  -  -  -  T  P  G  .  NTS 

14.  NC_030292.1  VVSNCTS  -  -  DCSGYA 

13.  NC028833  1  TS  TVCDRCDGFPhh 

16.  NC  028824.1  . S 

17.  NC_0  28814.1  KN  SVCKOCNGF  PJ^| 

18  NC_028811.1  •  O  K  Y  T  L  C  D 

19  NC  028806.1G  -  '  E  P  C 

20.  NC  028752.1  NTS 

21.  NC026011 

22.  NC_023217.1 

23.  NC  019843.3 

24.  NC023760  1  .  O  •  •  N  ^  D  N  N 

23.  NC_022^^M  I  T  N  LPOSGEi  SDN 

26.  NC_018871.1rI/  KDYTICGSCNGFP 

27.  NC_0 1708 3.1  . T  i  W 

28  NC  016992  1  . TER 

29.  NC  016991.1  .  DO--  N  OT 

30.  NC  016996.1  . II1--DNN1TRN 

31.  NC_016993.1 

32  NC_016994.1  . NQGN 

33.  NC  016993.1  . NSOTDV 

34.  NC  002306.3  I  E  ED  EHCTO  T  N 

33.  NC  014470.1  -  -  GKQ - R  I  •  D  N  P  N 

36.  NC_012936.1  . LNWFEPPF 

37.  NC_01 1550.1  . I  SVG  -  • 

38.  NC_01 1349.1  . PKS 

39  NC  010800.1 

40.  NC  010646.1  GN  F  S  L  F  G  L  GO  T  I  N  I 

41.  NC_0 10438.1  T  DK  ICDNCNG  P  A 

42.  NC  010437.1  S  O  S  VCODCDOI  PK 

43.  NC  009988.1 

44.  NC_009637.1  T  SVCTKHCTGI  ON 

43.  NC  009021.1  -  -  G  G . PIYNTS 

46.  NC  009020.1  •  -  PGRlNN: 

47.  NC  009019.1  -  -  OGDPTK  SN 

48.  NC_006377.2  . T  I  W  •  O  K  P  f 

49.  NC_003831.2  RVVN  YTVCDDCNG  TDN 

50.  NC  004718.3  . IGNPV 

31.  NC  003436.1  E  P<  I  NCTG 

32.  NC_003043.1  . TIWFKPP 

53.  NC  002645.1  NTS  SVCNGCVG  SEN 

54.  NC  001846.1  . ISWFOPP 

33.  NC  001431.1  . WHL 


Fig.Sl  Multiple  sequence  alignment  of  glycoprotein  of  coronaviridae  family,  representing  all  the 
four  inserts. 
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Fig.S2:  All  four  inserts  are  present  in  the  aligned  28  Wuhan  20 1 9-nCo V  virus  genomes  obtained 
from  GISAID.  The  gap  in  the  Bat-SARS  Like  CoV  in  the  last  row  shows  that  insert  1  and  4  is  very 
unique  to  Wuhan  2019-nCoV.  *.<$&■ 
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Fig. S3  Phylogenetic  tree 
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of  28  clinical  isolates  genome  of  2019-nCoV  including  one  from  bat  as  a  host. 
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Supplementary  Fig  4.  Genome  alingment  of  Coronaviridae  family.  Highlighted  black  sequences  are  the 
inserts  represented  here. 


